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1. INTRODUCTION 

Web 2.0 has given rise to numerous platforms and tools that allow internet users to express their 
viewpoints and ideas on various topics and happenings. Unfortunately, some individuals misuse these 
platforms to propagate hate speech and offensive content, leading to adverse impacts on the mental well- 
being of the online community [1], [2]. According to a 2021 pew research center survey, 41% of Americans 
have experienced online harassment, including offensive name-calling, intentional embarrassment, physical 
threats, stalking, and sexual harassment. Additionally, the cyberbullying research center reports that over 
30% of teenagers in the United States have endured some form of cyberbullying, including hurtful 
comments, spreading rumors, and threats. 

Therefore, the detection of offensive language has become an active research task in natural 
language processing (NLP). Offensive language can be defined as text that uses abusive slurs or derogatory 
terms [3]. Different forms of offensive language include hate speech, aggressive content, cyberbullying, and 
toxic comments. Many workshops and shared tasks have been conducted to encourage research in this field 
from various perspectives [4]-[7]. 
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Despite the crucial importance of this task, limited work has been done in non-English languages, 
like Arabic [8], [9]. Arabic occupies the 4™ position among the most commonly used languages on the 
internet. However, the ambiguity and informality of the written format of the Arabic text make the 
classification of Arabic social media content a very difficult task [10]. Additionally, the Arabic language has 
multiple varieties with various vocabularies and structures, which make it hard to get high classification 
results. 

Existing studies in Arabic offensive language detetcion have mainly adopted machine learning and 
deep learning based approaches. Alakrot et al. [11] employed a support vector machine (SVM) classifier and 
experimented with different word-level features, N-gram features, and various pre-processing techniques to 
detect offensive language on Arabic dataset. They also concluded that it is not beneficial to use both 
stemming and N-gram features within the same machine learning process. The methodology used in 
Shannaq et al. [12] consisted of two stages of optimization. In during the initial phase, the training dataset 
was employed to fine-tune multiple word embedding models for the extraction of word characteristics from 
the ArCybC corpus. In the second stage, a hybrid approach, combining a genetic algorithm (GA) with either 
SVM or eXtreme gradient boosting (XGBoost), was employed to enhance the model’s performance. 
Abuzayed and Elsayed [13] conducted a comparative analysis of 15 classical and neural learning models, 
using two different word representations, tf-idf, and word embeddings. The experimnetal results indicated 
that tf-idf representation yielded better results in classical models compared to word embeddings. However, 
the most effective neural learning model, a joint convolutional neural network (CNN) and long short-term 
memory (LSTM) architecture, outperformed all classical models. 

Recently, transformer-based models, such as bidirectional encoder representations from transformers 
(BERT) model have enhanced the results of many tasks [14]-[18], including offensive language detection. 
The work of Husain and Uzuner [19] investigated the impact of various preprocessing techniques on Arabic 
offensive language classification. Additionally, different models were examined including traditional machine 
learning, ensemble machine learning, artificial neural networks, and BERT-based models. The experimental 
results showed that the BERT-based models achieved better results over all other models. Moreover, the 
findings of this study indicate that preprocessing has limited gains for BERT-based classifiers in text 
classification pipelines, suggesting it can be omitted. Althobaiti [20] compared BERT with conventional 
machine learning techniques like SVM and logistic regression for handling this task. Additionally, they 
explored using sentiments and textual descriptions of emojis as additional features in the dataset. The 
experiments demonstrated that the BERT-based model outperformed all other examined models. Another 
study of El-Alami et al. [21] proposed an effective approach for multilingual offensive language detection 
(MOLD) using transfer learning based on BERT. The system comprises three stages: preprocessing, text 
representation with BERT, and classification into offensive and non-offensive categories. To address 
multilingualism, they investigate methods that involve both joint-multilingual and _ translation-based 
techniques. They obtained promising results with the translation-based method using the Arabic BERT model 
(AraBERT) by achieving over 93% F1-score and 91% accuracy on a bilingual dataset composed of English 
and Arabic reviews. The authors affirmed the robustness of BERT-based models in the MOLD field. 

Unlike most existing work in this field, our study aims to provide an enhanced model for Arabic 
offensive language detetcion without relying on tedious preprocessing or feature engineering tasks. 
Additionally, we investigate combining the BERT model with a bidirectional gated recurrent unit (BiGRU) 
layer to further improve the understanding of the context and relationships between words. Moreover, various 
Arabic BERT models are examined in this paper to select the most suitable one for this task. The main 
contributions of this study can be summarized as follows: 

— We propose an enhanced model for Arabic offensive language detection without the need for hand-crafted 
features or external linguistic resources, like lexicons. 

— We combine BERT with a BiGRU layer to enhance the extracted semantic and contextual features. 
As far as we know, no prior work has utilized this combination to perform the offensive language 
detection task in Arabic. 

— Extensive experiments on the Arabic SemEval 2020 dataset show the effectiveness of the proposed model 
in comparison to the baseline and related work models. 

The remainder of this paper is structured as follows: section 2 introduces related work to Arabic 
offensive language detection. The research methodology is provided in section 3. Section 4 presents the 
experimental setup. The experimental results are discussed in section 5. Finally, section 6 presents the 
conclusion and outlines directions for future research. 
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2. RELATED WORK 

Compared to the amount of work done in English, only a few studies have been conducted on 
detecting offensive language in Arabic [8], [9]. One of the earlier studies in this area was conducted by 
Mubarak et al. [22]. The authors built a list of 288 Arabic obscene words and another list of 127 hashtags. 
They then used this list along with additional patterns to gather Arabic abusive tweets from the Twitter API 
in 2014. These tweets were classified into two categories: tweets that did not contain any obscene word from 
the list of seed words, and those that included at least one of the words in the list. 

Alakrot et al. [11], comments from YouTube were collected and manually labeled by three 
annotators as either offensive or non-offensive. They trained an SVM classifier with different combinations 
of word-level features, N-gram features, and various pre-processing techniques. They achieved an F1-score 
of 82% using pre-processing applied with stemming. 

Mohaouchane et al. [23] sought to enhance the previous results by using Word2Vec embeddings 
with different neural network models, including CNN, bidirectional long short-term memory (BiLTSM), and 
BilTSM with attention. The CNN model achieved the highest accuracy score of 87.84%, and an Fl-score of 
84.05% over other models. 

In 2020, a shared task was conducted by the SemEval workshop [24] that targeted the offensive 
language detection task. It provided labeled datasets for many languages, including Arabic. The team of 
Alami et al. [25] ranked first in this competition for the Arabic language. The authors used AraBERT to 
encode the Arabic tweets, followed by a sigmoid layer for classification. They also examined the impact of 
translating the meaning of emojis on the overall performance of the proposed model. They achieved a macro 
Fl-score of 90.17%. 

Hassan et al. [26] attained the second rank by combining of CNN-BiLSTM, SVM, and multilingual 
BERT. The SVM classifier employed character n-grams, word n-grams, and word embeddings as features, 
whereas the CNN-BiLSTM model learned character embeddings and additionally employed pre-trained word 
embeddings as input. Their performance yielded a macro F1-score of 90.16%. 

Wang et al. [27] ranked third for Arabic. They proposed a unified approach to detect the offensive 
language in all languages, including Arabic. To this end, they used the XML-R model, which was pre-trained 
to learn all the language representations together. They then fed the output of [CLS] token of the top layer of 
XLM-R into a fully connected layer, using the same parameter for all languages. The proposed model 
achieved a macro F1-score of 89.89%. 

Safaya et al. [28] attained the fourth rank for Arabic. They combined the AraBERT model with a 
CNN layer to handle this task. The output of the last four hidden layers was fed into several filters and 
convolution layers of the CNN. Then, the output of CNN was fed into a dense layer with a sigmoid activation 
function for classification. They reported a macro F1-score of 89.72%. 

Another shared task was conducted in 2022 [29] and was divided into three subtasks: i) identify 
whether a tweet is offensive or not; ii) determine whether a tweet contains hate speech or not; and iii) 
determine the fine-grained type of hate speech (disability, social class, race, religion, ideology, and gender). 

The team of Mostafa et al. [30] ranked first in subtask A. Seven language models were examined in 
this paper. Moreover, an ensemble learning approach was used to further enhance the model performance. 
Besides, different loss functions were evaluated to address the data imbalance problem. The best results 
(macro F1-score=85.2%) were achieved using a majority voting technique between three models: i) QARiB 
trained using Dice loss, ii) MARBERT trained using VS loss, and iii) MARBERTv?2 was trained using Focal 
loss + label smoothing. 

AlKhamissi and Diab [31] achieved second place by proposing a multi-task learning approach to 
handle all three sub-tasks simultaneously. They first encoded input tweets using the fine-tuned MARBERT 
model, and then passed the output embedding to three task-specific classifiers. Each classifier consisted of a 
multilayered feedforward neural network with layer normalization. Their method achieved a macro F1-score 
of 84.5% in subtask A. 


3. METHOD 
3.1. Task description 

The objective of this study is to classify every text into one of two distinct classes: offensive or non- 
offensive. Therefore, this objective can be approached as a binary classification problem. In pursuit of this 
goal, the study aims to effectively differentiate texts based on their offensive content, simplifying the task 
into a two-class classification scenario. 


3.2. Model overview 
The whole architecture of the proposed model is illustrated in Figure 1. First, a BERT layer is used 
to generate the vector representations of the text input, followed by a BiGRU layer to further extract context 
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and semantic features. A fully connected dense layer with sigmoid activation function is then used to classify 
the text into one possible class. 


Output 


Backward 
GRU 
Bidirectional 
GRU 


Forward 
GRU 


mos E O A G 


Figure 1. Overall architecture of the proposed model 


3.2.1. Bidirectional encoder representations from transformers model 

BERT [32] is a pre-trained language model built based on transformers, which is an attention 
mechanism that employs an encoder to read the input text and a decoder for generating a prediction for the 
task. BERT uses only the encoder part for providing a language representation model. Besides, BERT makes 
the training bidirectional by considering context from both left and right directions across all layers. 

Moreover, the pre-training process of BERT involved two unsupervised tasks: masked language 
modeling (MLM) and next sentence prediction (NSP). For the first task, BERT randomly masks a portion of 
the input tokens and subsequently attempts to predict those hidden tokens. The second task allows the model 
to predict whether a sentence is the next sentence in a given sequence of sentences. The BERT model has 
improved the results of many NLP tasks including named entity recognition [33], [34] text classification [35], 
[36] and sentiment analysis [37], [38]. Figure 2 illustrates the architecture of the BERT model. 


Figure 2. The architecture of BERT model [32] 


There are two types of BERT-based models for the Arabic language: monolingual models and 
multilingual models. The first type pre-trained the BERT architecture on Arabic content only. This content 
can be written in classical Arabic, modern standard Arabic (MSA), or dialectical Arabic (DA). For the 
second type, the BERT model is pre-trained on multilingual content, including Arabic. Table 1 describes the 
main Arabic BERT models that are publicly available for the researchers’ community. 
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Table 1. Main publicaly available Arabic BERT models 
Type of 


Size of pre-trained 


Model Source 
Arabic dataset 
AraBERTv02 [39] MSA Various Arabic corpora like El-Khair [40] and OSIAN [41] 8.6B tokens 
AraBERTv02- MSA+DA The same dataset as AraBERTvO2 in addition to Arabic tweets 8.6B tokens+16M 
twitter tweets 
MARBERTv2 [42] MSA+DA MSA corpora such as: OSCAR [43] and OSIAN [41] in 29B tokens 
addition to Arabic Tweets 
Qarib [44] MSA+DA news and movie/TV subtitles, while the dialectical text includes 14B tokens 
tweets 
CamelBERT [45] DA A range of dialectical corpora like NADI [46] and QADI [47] 5.6B tokens 
mBERT [32] MSA Wikipedia 7292 tokens 


In this study, the BERT model is fine-tuned to learn specific knowledge relevant to the downstream 
task. Additionally, we employed the final hidden state vector of the special token [CLS] as the representation 
of the entire input sequence. The output of the BERT model can be represented as (1): 


x= Aters] E Rd (1) 
Where the value of the dimension d is equal to 768. 


3.2.2. Bidirectional gated recurrent unit layer 

GRU is a variant of the recurrent neural network (RNN) that was created to tackle the issue of long- 
term dependencies and the gradient vanishing problem. Its structure is simpler than LSTM, as it combines the 
input and forget gates into a single update gate and merges the hidden and cell states into a single hidden 
state, as depicted in Figure 3. The update gate, denoted as z,, regulates the volume of past information that 
should be transmitted to the next state. On the other hand, the reset gate, indicated as r,, controls the amount 
of previous information that should be disregarded. The calculation formula is provided as (2)-(5): 


Zp = O(WzxXt + Uznhe-1) (2) 
Te = OWyxxe + Urnhe-r) (3) 
oO, = tanh(W;xXt + % © Ucnhe-1) (4) 
Ge = (1 — 2) Oo, + 27, Ohy_1) (5) 


Where o represents the sigmoid function and © denotes the matrix’s element-wise product. The weight 
matrices W and U must be learned. Since GRU networks can only handle sequences from front to back, we 
employed a BiGRU layer to process the data from both directions and generate complete contextual features. 
The output of the hidden layer g+ at time t is the concatneation the backward and forward states as (6): 


Ge = [Ge D g] (6) 


Figure 3. The architecture of GRU 
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The output of BiGRU can be represented as (7): 


9 = {91,92 = In} (7) 


We then use a fully connected dense layer with a sigmoid function to generate the final predictions, which 
classify the input text as offensive or not offensive. 


ĵ = Sigmoid (Vg + b) (8) 


Here, ¥ represents the predicted probabilities, V is a weight matrix that can be adjusted during training, and b 
is a bias term. 


4. EXPERIMENTS 
4.1. Dataset 

The dataset used in this study was released by SemEval 2020 task 12 [24]. It comprises 
10,000 tweets gathered during the period of April to May 2019 using the Twitter API and annotated manually 
as either offensive or non-offensive. More details about the dataset can be found in Mubarak et al. [48]. 
Table 2 illustrates the distribution of the data in terms of training and testing sets. 


Table 2. The distribution size of the dataset 
Train Test 
Off Not Total Off Not Total 
1589 6411 8000 402 1598 2000 


4.2. Experimental settings 

The proposed model was implemented in Python using TensorFlow and Keras libraries. For the 
BERT model, we used the base version, containing 12 layers of transformers with 12 self-attention heads and 
a hidden size of 768. Additionally, we used a max sequence length of 128, a batch size of 32, and a number 
of epochs of 5. 


4.3. Evaluation metrics 
To compare our model with the baseline and related work models, we used the accuracy and macro 
Fl-score metrics, computed using as (9): 


TP+TN 


Accuracy = —————_—_ 
y TP+TN+FP+FN 


(9) 


Where TP, FP, and FN denote the number of true positives, false positives, and false negatives, respectively. 


2 XMPXMR 


Macro F1 = ——— (10) 
MP+MR 
— iyc 
MP = po MP; (11) 
—iyc 
MR Say MR; (12) 


Where C denotes the number of classes. MP;, MR; are the precision and recall for class j, respectively. 


4.4. Baseline and related work models 
The proposed model is compared with the following baseline models: 
— Majority baseline [24]: it is the baseline model provided by the SemEval task for the Arabic dataset. 
— BERT: we remove the BiGRU layer and fine-tuned the BERT model with a linear layer and a sigmoid 
activation function. 
— BERT-BiLSTM: we replaced the BiGRU layer with BiLSTM in the proposed model to examine its 
impact on the overall performance. 
In addition to these baselines, the results of related work models, that were discussed in section 2, 
are also included, namely AraBERTEmojisOUT [25], SVM and ValenceList + C-LSTM + Mult-BERT [26], 
XLM-R [15], and BERT-CNN Kuisal [28]. 
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5. RESULTS AND DISCUSSION 
5.1. Selection of the BERT model 

There are many BERT-based models that have been implemented to support research in the Arabic 
language, as illustrated in Table 1. Thus, we first conducted various experiments to select the best BERT 
model for our proposed system. The experimental results are depicted in Table 3. 

It can be noticed that mBERT achieved the worst results, which can be explained by the fact that 
this model was pre-trained on much less amount of Arabic datasets compared to the monolingual models. 
Among MARBERTv2, AraBERTv02, AraBERTv02-twitter, and Qarib, MARBERTv2 yielded the best 
results, likely attributed to the extensive pre-trained dataset (refer to Table 1). Furthermore, the dataset is a 
combination of MSA and DA tweets, which aligns with the evaluated data in this study. Therefore, we use 
MARBERTV2?2 to implement the evaluated models in this paper. 


Table 3. Results of our proposed model using different Arabic BERT models 


BERT model Accuracy (%) Macro Fl-score (%) 
MBERT 91.5 85.18 
CamelBERT 93.4 89.36 
Qarib 94.7 91.39 
AraBERTv02 94.30 90.83 
AraBERTv02-twitter 94.55 91.26 
MARBERTv2 95.55 93.16 


5.2. Effect of hyper-parameters 

To determine the optimal hyper-parameters for our proposed model, we conducted a sensitivity 
analysis by testing different configurations. We started by tuning the learning rate, which is crucial for weight 
control during back-propagation and affects training time until convergence. A high initial learning rate can 
cause unstable learning and divergence, while a low learning rate can result in slow convergence. We 
illustrated the results of testing different learning rates on the proposed model in Figure 4 and found that a 
learning rate of 5e-5 produced the best performance. Higher or lower learning rates reduced performance, so 
we used this learning rate for all implemented models in this study. 


| | | | 
90 i l l l 


1.00E-05 2.00E-05 5.00E-05 8.00E-05 


Scores 


Learning rate 


@Accuracy W Macro F1-score 


Figure 4. Experimental results using different learning rate values 


The second hyper-parameter we optimized was the optimizer. We evaluated our model using 
different optimization methods, including Adam, Adamax, RMSProp, and SGD, as shown in Figure 5. The 
evaluation results showed that SGD performed poorly with a macro Fl-score less than of 46%, whereas 
Adam and RMSProp achieved comparable results. Meanwhile, the Adamax optimizer outperformed the other 
methods in terms of macro Fl-score. Therefore, we used it to implement our proposed model. 
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Figure 5. Experimental results using different hidden units’ values 


Another crucial parameter in network design is the number of hidden units in the GRU layer, which 
significantly impacts the model’s training duration and complexity. Therefore, optimizing this parameter is 
essential to reduce model complexity and improve its execution performance and predictive capability. We 
illustrated the results of the sensitivity analysis for this parameter in Figure 6 and found that using 32 or 64 
hidden units yielded comparable results, while the best value was achieved when using a number of hidden 
units of 128. However, when this number increased to 256 and 512, the overall performance decreased. Thus, 
we set the number of hidden units in the GRU layer to 128 based on the best value of the macro Fl-score. 


90 
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60 
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W Accuracy W Macro F1-score 


Figure 6. Experimental results using different optimizers 


5.3. Comparative analysis 

The main experimental results are presented in Table 4. They indicate that the proposed model 
achieved an overall enhancement of more than 25% in terms of F-1 score compared to the baseline model. 
Moreover, our model outperforms related work models that used extra features with the BERT model 
(i.e., AraBERTEmojisOUT) or adopted an ensemble structure (i.e., SVM and ValenceList + C-LSTM + Mult- 
BERT) to resolve this task using the same dataset as in this study. Additionally, the MARBERTv2-BiLSTM 
and MARBERTv?2-BiGRU achieved better results than the MARBERTv2 model, which was fine-tuned using a 
linear layer only. This indicates the effectiveness of incorporating the fine-tuned MARBERTv2 model with 
more powerful neural network layers to further enhance the extracted semantic and contextual features. 

Furthermore, our model significantly outeperfoms the BERT-CNN model, which can be justified by 
the fact that capturing long-range dependencies in the data and considering the order of the words are crucial 
for understanding the context and improving the classification performance. Besides, our model achieves 
better results than BERT-GRU, which indicates the effectiveness of using bidirectional layers to encode 
features from both left and right sides for handling this task. In addition, our model outperforms the BERT- 
BiLSTM model. This can be explained by the fact that GRU has a simpler architecture than LSTM, 
potentially simplifying the training process. 
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Table 4. Main evaluation results. The results with “t” were retrieved from original papers 


Model Accuracy Macro Fl-score (%) 
Majority Baseline [24] = 44.41° 
AraBERTEmojisOUT [25] 93.97 90.177 
SVM and ValenceList + C-LSTM + Mult-BERT [26] 93.857 90.16" 
XLM-R [15] - 89.897 
BERT-CNN Kuisal [28] - 89.74 
MARBERTv?2-linear 94.55 91.17 
MARBERTv2-GRU 94.55 91.66 
MARBERTv2-BiSLTM 95.0 92.41 
MARBERTv2-BiGRU (ours) 95.55 93.16 


6. CONCLUSION 

In this paper, an enhanced BERT-based model is proposed to address the offensive language 
detection task on an Arabic reference dataset. The proposed model employs BERT to generate contextualized 
vector representations, followed by a BiGRU layer to further improve the extracted context and semantic 
features. The experimental results showed the effectiveness of our model compared to the baseline and 
related work models by achieving a macro F1-score of 93.16%. Additionally, the obtained results prove the 
effeciency of combining BERT with bidirectional sequential layers to further improve its semantic 
understanding. 

Future work directions include evaluating our model on other Arabic NLP tasks, such as fine-grained 
hate speech detection. Additionally, we intend to implement our model using other pre-trained language 
models than BERT, such as the XLNET model. Moreover, we plan to adapt our model to handle the task 
offensive language detection on multilingual corpora. In addition, the dataset used in this study is 
imbalanced. Thus, future work direction includes investigating various methods to handle the class imbalance 
issue and examining their impact on the overall performance of our model. 
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